Problem Statement: Concrete Strength Prediction

Objective

To predict the concrete strength using the data available in file "concrete.csv". Apply feature engineering and model tuning to obtain a score above 85%.

Resources Available

The data for this project is available in file https://archive.ics.uci.edu/ml/machine-learning-databases/concrete/compressive/. The same has been shared along with the course content.

Attribute information

Given are the variable name, variable type, the measurement unit, and a brief description. The concrete compressive strength is the regression problem. The order of this listing corresponds to the order of numerals along the rows of the database.

Name                                    Data Type       Measurement         Description
  1. Cement (cement) quantitative kg in a m3 mixture Input Variable
  2. Blast Furnace Slag (slag) quantitative kg in a m3 mixture Input Variable
  3. Fly Ash (ash) quantitative kg in a m3 mixture Input Variable
  4. Water(water) quantitative kg in a m3 mixture Input Variable
  5. Superplasticizer (superplastic) quantitative kg in a m3 mixture Input Variable
  6. Coarse Aggregate (coarseagg) quantitative kg in a m3 mixture Input Variable
  7. Fine Aggregate (fineagg) quantitative kg in a m3 mixture Input Variable
  8. Age(age) quantitative Day (1~365) Input Variable
  9. Concrete compressive strength(strength) quantitative MPa Output Variable

Exploratory Data Quality Report Reflecting the Following:

Univariate analysis (10 marks)

Data types and description of the independent attributes which should include (name, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions/tails, missing values, outliers, duplicates (10 Marks)

Bi-variate analysis (10 marks)

between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms, or density curves.

Feature Engineering techniques (10 marks)

Identify opportunities (if any) to extract new features from existing features, drop a feature(if required) Hint: Feature Extraction, for example, consider a dataset with two features length and breadth. From this, we can extract a new feature Area which would be length * breadth. Get the data model ready and do a train test split. Decide on the complexity of the model, should it be a simple linear model in terms of parameters or would a quadratic or higher degree. (10 marks)

Creating the Model and Tuning It (30 marks)

Algorithms that you think will be suitable for this project. Use Kfold Cross-Validation to evaluate model performance. Use appropriate metrics and make a DataFrame to compare models w.r.t their metrics. (at least 3 algorithms, one bagging and one boosting based algorithms have to be there). (15 marks)

Techniques employed to squeeze that extra performance out of the model without making it overfit. Use Grid Search or Random Search on any of the two models used above. Make a DataFrame to compare models after hyperparameter tuning and their metrics as above. (15 marks)

Import all the header libraries

In [1]:
import os, sys, re
import numpy as np 
import pandas as pd
pd.options.display.float_format = '{:,.4f}'.format

# For Plot
import matplotlib.pyplot as plt 
import seaborn as sns
# Add nice background to the graphs
sns.set(color_codes=True)
# To enable plotting graphs in Jupyter notebook
%matplotlib inline

import pydotplus
import graphviz

# sklearn libraries
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import PolynomialFeatures

from sklearn.linear_model import LinearRegression, Lasso, Ridge, LogisticRegression

from sklearn.tree import export_graphviz
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import AdaBoostClassifier
from sklearn.ensemble import BaggingClassifier
from sklearn.ensemble import GradientBoostingClassifier

from sklearn.model_selection import KFold
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import train_test_split, cross_val_score

# calculate confusion matrix and various performance metrics
from sklearn.metrics import confusion_matrix, recall_score, precision_score, \
    f1_score, accuracy_score, mean_squared_error, r2_score
#AUC ROC curve
from sklearn.metrics import roc_auc_score, roc_curve

# Additional libs
from yellowbrick.classifier import ClassificationReport, ROCAUC
from six import StringIO

from IPython.display import Image  
from IPython.core.display import display, HTML
display(HTML("<style>.container { width:95% !important; }</style>"))

Read the input file into pandas dataframe

In [2]:
# Let's read the data  into dataframe
csp_df = pd.read_csv('concrete.csv')
csp_df.head()
Out[2]:
cement slag ash water superplastic coarseagg fineagg age strength
0 141.3000 212.0000 0.0000 203.5000 0.0000 971.8000 748.5000 28 29.8900
1 168.9000 42.2000 124.3000 158.3000 10.8000 1,080.8000 796.2000 14 23.5100
2 250.0000 0.0000 95.7000 187.4000 5.5000 956.9000 861.2000 28 29.2200
3 266.0000 114.0000 0.0000 228.0000 0.0000 932.0000 670.0000 28 45.8500
4 154.8000 183.4000 0.0000 193.3000 9.1000 1,047.4000 696.7000 28 18.2900

1. Univariate analysis (10 marks)

Data types and description of the independent attributes which should include (name, range of values observed, central values (mean and median), standard deviation and quartiles, analysis of the body of distributions/tails, missing values, outliers, duplicates (10 Marks)

In [3]:
# print the shape of the dataframe
csp_df.shape
Out[3]:
(1030, 9)
In [4]:
# print the datatypes of each columns
csp_df.dtypes
Out[4]:
cement          float64
slag            float64
ash             float64
water           float64
superplastic    float64
coarseagg       float64
fineagg         float64
age               int64
strength        float64
dtype: object

Insights:

There are 9 columns and 1030 rows in the dataset. Looking at the output of the dtypes, observed that columns have float and int datatype. Observed the duplicate records and columns with zeros and I will treat/convert/drop them as needed in the following steps

Find missing values using info() and isnull() functions. Find the duplicates and remove them and print the descriptive statistics (min, max, mean, median, standard deviation and quartiles) of each & every column using describe() function

In [5]:
# List down total samples and weather they are null and the dtype of the dataframe
csp_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1030 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cement        1030 non-null   float64
 1   slag          1030 non-null   float64
 2   ash           1030 non-null   float64
 3   water         1030 non-null   float64
 4   superplastic  1030 non-null   float64
 5   coarseagg     1030 non-null   float64
 6   fineagg       1030 non-null   float64
 7   age           1030 non-null   int64  
 8   strength      1030 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 72.5 KB

Let's check the some of the individual attributes for their basic statistic such as central values, spread, tails etc.

In [6]:
# Check the unique values in each column of the dataframe.
csp_df.nunique()
Out[6]:
cement          278
slag            185
ash             156
water           195
superplastic    111
coarseagg       284
fineagg         302
age              14
strength        845
dtype: int64
In [7]:
# Let's find all the duplicate rows in the data frame
csp_df[csp_df.duplicated(keep=False)]
Out[7]:
cement slag ash water superplastic coarseagg fineagg age strength
27 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 91 65.2000
49 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 3 33.4000
88 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 3 35.3000
91 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 7 55.9000
96 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 28 71.3000
190 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 56 77.3000
245 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 91 79.3000
278 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 3 33.4000
298 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 3 33.4000
333 252.0000 0.0000 0.0000 185.0000 0.0000 1,111.0000 784.0000 28 19.6900
392 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 7 49.2000
400 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 3 35.3000
420 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 3 35.3000
433 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 28 60.2900
463 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 56 77.3000
468 252.0000 0.0000 0.0000 185.0000 0.0000 1,111.0000 784.0000 28 19.6900
482 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 91 65.2000
489 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 56 64.3000
493 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 91 79.3000
517 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 56 64.3000
525 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 28 71.3000
527 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 91 65.2000
576 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 7 55.9000
577 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 28 60.2900
604 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 56 77.3000
733 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 91 79.3000
738 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 28 71.3000
766 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 91 79.3000
830 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 7 49.2000
880 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 56 64.3000
884 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 7 49.2000
892 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 56 77.3000
933 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 7 55.9000
943 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 3 35.3000
967 362.6000 189.0000 0.0000 164.9000 11.6000 944.7000 755.8000 28 71.3000
992 425.0000 106.3000 0.0000 153.5000 16.5000 852.1000 887.1000 28 60.2900
In [8]:
# Let's drop all the duplicate rows from the dataframe
csp_df.drop_duplicates(inplace=True, keep="first")
In [9]:
csp_df.shape
Out[9]:
(1005, 9)
In [10]:
csp_df.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1005 entries, 0 to 1029
Data columns (total 9 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   cement        1005 non-null   float64
 1   slag          1005 non-null   float64
 2   ash           1005 non-null   float64
 3   water         1005 non-null   float64
 4   superplastic  1005 non-null   float64
 5   coarseagg     1005 non-null   float64
 6   fineagg       1005 non-null   float64
 7   age           1005 non-null   int64  
 8   strength      1005 non-null   float64
dtypes: float64(8), int64(1)
memory usage: 78.5 KB

NOTE: Dropped the duplicate records so shape is reduced to 1005 rows and 9 columns

In [11]:
csp_df.describe().T
Out[11]:
count mean std min 25% 50% 75% max
cement 1,005.0000 278.6313 104.3443 102.0000 190.7000 265.0000 349.0000 540.0000
slag 1,005.0000 72.0435 86.1708 0.0000 0.0000 20.0000 142.5000 359.4000
ash 1,005.0000 55.5363 64.2080 0.0000 0.0000 0.0000 118.3000 200.1000
water 1,005.0000 182.0753 21.3393 121.8000 166.6000 185.7000 192.9000 247.0000
superplastic 1,005.0000 6.0332 5.9200 0.0000 0.0000 6.1000 10.0000 32.2000
coarseagg 1,005.0000 974.3768 77.5797 801.0000 932.0000 968.0000 1,031.0000 1,145.0000
fineagg 1,005.0000 772.6883 80.3404 594.0000 724.3000 780.0000 822.2000 992.6000
age 1,005.0000 45.8567 63.7347 1.0000 7.0000 28.0000 56.0000 365.0000
strength 1,005.0000 35.2504 16.2848 2.3300 23.5200 33.8000 44.8700 82.6000
In [12]:
# Cross validate the Non-Null value reported by info()
csp_df.isnull().sum()
Out[12]:
cement          0
slag            0
ash             0
water           0
superplastic    0
coarseagg       0
fineagg         0
age             0
strength        0
dtype: int64
In [13]:
# Let's replace all the Zeros from the columns(slag, ash and superplastic) with mean value
# Make all 0 to np.nan , then fillna
csp_df=csp_df.mask(csp_df==0).fillna(csp_df.mean())
In [14]:
csp_df.describe().T
Out[14]:
count mean std min 25% 50% 75% max
cement 1,005.0000 278.6313 104.3443 102.0000 190.7000 265.0000 349.0000 540.0000
slag 1,005.0000 105.7355 62.1243 11.0000 72.0435 72.0435 142.5000 359.4000
ash 1,005.0000 85.4320 39.5736 24.5000 55.5363 55.5363 118.3000 200.1000
water 1,005.0000 182.0753 21.3393 121.8000 166.6000 185.7000 192.9000 247.0000
superplastic 1,005.0000 8.3025 4.0233 1.7000 6.0332 6.1000 10.0000 32.2000
coarseagg 1,005.0000 974.3768 77.5797 801.0000 932.0000 968.0000 1,031.0000 1,145.0000
fineagg 1,005.0000 772.6883 80.3404 594.0000 724.3000 780.0000 822.2000 992.6000
age 1,005.0000 45.8567 63.7347 1.0000 7.0000 28.0000 56.0000 365.0000
strength 1,005.0000 35.2504 16.2848 2.3300 23.5200 33.8000 44.8700 82.6000

Insights:

I have found few sample records with duplicate rows and dropped them. Looking at the output of the describe(), info(), and isnull(), it says that there are no columns with missing values. There were few columns with 0's and replaced them with the mean value of the column. I will revisit this section and treat the variables in the data preparation stage.

In [15]:
# List all the columns of the dataframe
csp_df.columns
Out[15]:
Index(['cement', 'slag', 'ash', 'water', 'superplastic', 'coarseagg',
       'fineagg', 'age', 'strength'],
      dtype='object')
In [16]:
# Let's visualize the individual data 
# and see how do they look. This helps in further decision making about the data
csp_df[csp_df.columns].hist(stacked=False, bins=50, figsize=(30,120), layout=(10,1));
2020-10-30T13:47:05.432282 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [17]:
# Let's implement the valueCount function
def valueCount(pd_df=None):
    columns = pd_df.columns
    for col in columns:
        print('value_counts for {}'.format(col))
        print(pd_df[col].value_counts(normalize=True).head(10))
        print()   
In [18]:
# Let's print the value count value of all the variables. Let's limit the rows to 10 for better visibility
valueCount(csp_df)
value_counts for cement
251.4000   0.0149
446.0000   0.0139
310.0000   0.0139
250.0000   0.0129
475.0000   0.0129
331.0000   0.0129
387.0000   0.0119
349.0000   0.0119
236.0000   0.0109
165.0000   0.0109
Name: cement, dtype: float64

value_counts for slag
72.0435    0.4677
189.0000   0.0159
24.0000    0.0139
20.0000    0.0119
145.0000   0.0109
98.1000    0.0100
19.0000    0.0100
106.3000   0.0100
26.0000    0.0080
22.0000    0.0080
Name: slag, dtype: float64

value_counts for ash
55.5363    0.5383
118.3000   0.0199
141.0000   0.0159
24.5000    0.0149
79.0000    0.0139
94.0000    0.0129
100.4000   0.0109
98.8000    0.0100
100.5000   0.0100
174.2000   0.0100
Name: ash, dtype: float64

value_counts for water
192.0000   0.1174
228.0000   0.0537
185.7000   0.0458
203.5000   0.0358
186.0000   0.0279
162.0000   0.0199
200.0000   0.0139
178.0000   0.0139
185.0000   0.0139
193.0000   0.0139
Name: water, dtype: float64

value_counts for superplastic
6.0332    0.3761
8.0000    0.0269
11.6000   0.0229
7.0000    0.0189
6.0000    0.0169
9.0000    0.0159
9.9000    0.0159
8.9000    0.0159
7.8000    0.0159
10.0000   0.0149
Name: superplastic, dtype: float64

value_counts for coarseagg
932.0000     0.0567
852.1000     0.0348
968.0000     0.0289
1,125.0000   0.0239
1,047.0000   0.0189
967.0000     0.0189
944.7000     0.0159
974.0000     0.0119
822.0000     0.0119
942.0000     0.0119
Name: coarseagg, dtype: float64

value_counts for fineagg
594.0000   0.0299
670.0000   0.0229
613.0000   0.0219
755.8000   0.0159
801.0000   0.0159
746.6000   0.0149
845.0000   0.0139
712.0000   0.0139
750.0000   0.0119
780.1000   0.0100
Name: fineagg, dtype: float64

value_counts for age
28    0.4169
3     0.1284
7     0.1214
56    0.0856
14    0.0617
90    0.0537
100   0.0517
180   0.0259
91    0.0169
365   0.0139
Name: age, dtype: float64

value_counts for strength
33.4000   0.0040
41.0500   0.0040
31.3500   0.0040
23.5200   0.0040
42.1300   0.0030
37.2700   0.0030
32.7200   0.0030
28.6300   0.0030
44.5200   0.0030
39.3000   0.0030
Name: strength, dtype: float64

Insights:

` Looking at the value_count() dump and the histogram plot, following observatios are made:

1). The data for the slag, ash and superplastic columns are highly scewed and most samples had value 0. Replaced zero with mean value

2). Age column is also scewed towards the lower number. ie. there are more samples with days value <=28. The most age samples value is 28. and they contibute ~41% of the total data samples

`

2. Bi-variate analysis (10 marks)

between the predictor variables and target column. Comment on your findings in terms of their relationship and degree of relation if any. Visualize the analysis using boxplots and pair plots, histograms, or density curves.

In [19]:
# NOTE: Disabling the pair plot as It takes lots of memory and compute power.
# Let's plot pair plots to see the relation between variables
plt.figure(figsize=(20,5))
sns.pairplot(csp_df)
plt.show()
<Figure size 1440x360 with 0 Axes>
2020-10-30T13:47:34.825928 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [20]:
# Let's plot the box plot between age and strength.
# Set the plot window size
plt.figure(figsize=(20,5))
sns.boxplot(csp_df['age'], csp_df['strength']);
plt.show()
2020-10-30T13:47:48.205227 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [21]:
# Let's plot the box plot between superplastic and strength.
# Set the plot window size
plt.figure(figsize=(30,5))
sns.boxplot(csp_df['superplastic'], csp_df['strength']);

plt.xticks(
    rotation=90, 
    horizontalalignment='center',
    fontweight='normal',
    fontsize='large'  
)

plt.show()
2020-10-30T13:47:52.289525 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [22]:
# Let's plot the box plot between fineagg and strength.
# Set the plot window size
plt.figure(figsize=(30,5))
sns.boxplot(csp_df['fineagg'], csp_df['strength']);

plt.xticks(
    rotation=90, 
    horizontalalignment='center',
    fontweight='normal',
    fontsize='large'  
)

plt.show()
2020-10-30T13:48:04.641082 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [23]:
# Let's plot the box plot between coarseagg and strength.
# Set the plot window size
plt.figure(figsize=(30,5))
sns.boxplot(csp_df['coarseagg'], csp_df['strength']);

plt.xticks(
    rotation=90, 
    horizontalalignment='center',
    fontweight='normal',
    fontsize='large'  
)
plt.show()
2020-10-30T13:48:23.484486 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [24]:
# Let's plot the box plot between ash and strength.
# Set the plot window size
plt.figure(figsize=(30,5))
sns.boxplot(csp_df['ash'], csp_df['strength']);

plt.xticks(
    rotation=90, 
    horizontalalignment='center',
    fontweight='normal',
    fontsize='large'  
)
plt.show()
2020-10-30T13:48:38.143264 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [26]:
# Let's plot the box plot between slag and strength.
# Set the plot window size
plt.figure(figsize=(30,5))
sns.boxplot(csp_df['slag'], csp_df['strength']);

plt.xticks(
    rotation=90, 
    horizontalalignment='center',
    fontweight='normal',
    fontsize='large'  
)
plt.show()
2020-10-30T13:49:00.752493 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [27]:
# Let's plot Strength vs Cement, Water and age
plt.figure(figsize=(30,10))
sns.scatterplot(y="strength", x="cement", hue="water",size="age", data=csp_df, sizes=(50, 300))
Out[27]:
<AxesSubplot:xlabel='cement', ylabel='strength'>
2020-10-30T13:49:07.571995 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [28]:
# Let's plot Strength vs Fineagg, Ash and Superplastic
plt.figure(figsize=(30,10))
sns.scatterplot(y="strength", x="fineagg", hue="ash", size="superplastic", data=csp_df, sizes=(50, 300))
Out[28]:
<AxesSubplot:xlabel='fineagg', ylabel='strength'>
2020-10-30T13:49:08.826688 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [29]:
# Let's see the corelation of the variables
corr = csp_df.corr()
sns.heatmap(corr, annot=True, cmap='Blues')
Out[29]:
<AxesSubplot:>
2020-10-30T13:49:10.240694 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
In [30]:
# Let's see the customer's various account relationship for strength
csp_df.groupby('strength').mean() 
Out[30]:
cement slag ash water superplastic coarseagg fineagg age
strength
2.3300 108.3000 162.4000 55.5363 203.5000 6.0332 938.2000 849.0000 3.0000
3.3200 122.6000 183.9000 55.5363 203.5000 6.0332 958.2000 800.1000 3.0000
4.5700 102.0000 153.0000 55.5363 192.0000 6.0332 887.0000 942.0000 3.0000
4.7800 153.0000 102.0000 55.5363 192.0000 6.0332 888.0000 943.1000 3.0000
4.8300 141.3000 212.0000 55.5363 203.5000 6.0332 971.8000 748.5000 3.0000
... ... ... ... ... ... ... ... ...
79.4000 389.9000 189.0000 55.5363 145.9000 22.0000 944.7000 755.8000 56.0000
79.9900 540.0000 72.0435 55.5363 162.0000 2.5000 1,040.0000 676.0000 28.0000
80.2000 323.7000 282.8000 55.5363 183.8000 10.3000 942.7000 659.9000 56.0000
81.7500 315.0000 137.0000 55.5363 145.0000 5.9000 1,130.0000 745.0000 28.0000
82.6000 389.9000 189.0000 55.5363 145.9000 22.0000 944.7000 755.8000 91.0000

845 rows × 8 columns

In [31]:
# pd.crosstab(csp_df['strength'],csp_df['ash'],normalize='index')

Insights:

`

Looking at the bivariat pair plots and corelation plots, following observations are made:

1). Observe a high positive correlation between Strength and Cement. Concrete strength indeed increases with an increase in the amount of cement used in preparing it.

2). Age and Superplastic are the other two factors influencing Compressive strength. There are other strong correlations between the features: A strong negative correlation between Superplastic and Water. Found positive correlations between Superplastic and Ash. Found positive correlation between Superplastic Fine Aggregate.

3). Compressive strength increases as the amount of cement increases

4). Compressive strength increases with age

5). Concrete strength increases when less water is used in preparing it

6). Compressive strength decreases with ash increases

7). Compressive strength increases with Superplasticizer

8). Compressive strength increases with slag

`

Feature Engineering techniques (10 marks)

Identify opportunities (if any) to extract new features from existing features, drop a feature(if required) Hint: Feature Extraction, for example, consider a dataset with two features length and breadth. From this, we can extract a new feature Area which would be length * breadth. Get the data model ready and do a train test split. Decide on the complexity of the model, should it be a simple linear model in terms of parameters or would a quadratic or higher degree. (10 marks)

In [32]:
# Looking at the the data they are in different scale. So let's bring all the variable in the same scale using standard scaler
ss_scaler = StandardScaler()

# Let's separate the dependent and independent variables into X and Y variables
csp_df_Y = csp_df[['strength']]
csp_df_X = csp_df.drop('strength', axis=1)

# Let's apply Binning to output variable "strength"
# Let's replace numrical values to low, mid and high
csp_df_Y['strength'] = pd.cut(csp_df_Y['strength'], bins=[0,27,54,90], labels=["low", "mid", "high"])
# print(csp_df_Y['strength'])

# Let's convert string to numerical values
csp_df_Y = csp_df_Y.replace({"strength":{"low":0, "mid":1, "high":1}})

# Fit and transform the X variables
csp_df_X_scaled = pd.DataFrame(ss_scaler.fit_transform(csp_df_X), columns=csp_df_X.columns)
In [33]:
# Let's split the data into training and test set in the ratio of 70:30 respectively
X_train, X_test, y_train, y_test = train_test_split(csp_df_X_scaled, csp_df_Y, test_size=0.30, random_state=11)
In [34]:
X_train.shape
Out[34]:
(703, 8)
In [35]:
X_test.shape
Out[35]:
(302, 8)
In [36]:
y_train.shape
Out[36]:
(703, 1)
In [37]:
y_test.shape
Out[37]:
(302, 1)
In [38]:
print(X_train.head())
print()
print(y_train.head())
     cement    slag     ash   water  superplastic  coarseagg  fineagg     age
928  0.5165  0.5921 -0.7558  2.1532       -0.5643    -0.5465  -2.2252  3.5186
6   -1.0723  2.3266 -0.7558  1.0045       -0.5643     0.0158  -0.9974 -0.6100
663 -0.3886 -0.5426  0.2191  0.2168       -0.3239    -0.3157   0.9254  0.1592
158 -1.5594  1.0833 -0.7558  0.4653       -0.5643    -0.8328   1.4846 -0.2803
513  1.6048 -1.3163 -0.1626 -0.9412        0.8200    -0.0951  -0.7558 -0.2803

     strength
951         1
6           0
677         1
158         0
522         1
In [39]:
print(X_test.head())
print()
print(y_test.head())
     cement    slag     ash  water  superplastic  coarseagg  fineagg     age
413  0.5213 -0.5426 -0.7558 0.4653       -0.5643    -0.5568   0.8706  0.6930
712 -0.6389  0.5728 -0.7558 1.0045       -0.5643    -0.0126  -0.2825 -0.2803
265  0.5213 -0.5426 -0.7558 0.4653       -0.5643    -0.5568   0.8706 -0.6728
272  0.6747 -0.5426 -0.7558 0.4653       -0.5643     1.0526   0.4522  1.1639
894  1.0199 -0.5426 -0.7558 0.1840       -0.5643    -0.1080  -0.1207 -0.6100

     strength
416         1
726         1
265         0
272         1
915         0

we can fit different models on the training data and compare their performance. As this is a regression problem, for primary model, I will use linear regression model and will apply Lasso and Reidge algorithms. I will use RMSE (Root Mean Square Error) and R² score as evaluation metrics. I will apply poly regression models and will finalize the algorithm with good performance. Later, I will use more complex algorithms such as Logistic Regression,decision tree, RamdomForest, bragging and boosting algorithms and compare the performance

In [40]:
# Linear Regression 
lr = LinearRegression()
# Fitting models on Training data 
lr.fit(X_train, y_train)
print(lr.score(X_train, y_train))
print()
print()
# Making predictions on Test data 
y_pred_lr = lr.predict(X_test) 
# Compute rmse and R2 
rmse_lr = np.sqrt(mean_squared_error(y_test, y_pred_lr))
R2_score_lr = r2_score(y_test, y_pred_lr)
print("Linear Regression \t RMSE = {:.2f} \t R2_score = {:.2f}".format(rmse_lr, R2_score_lr))

print()
# Let's apply Lasso Regression Algorithm
lasso = Lasso() 
# Fitting models on Training data 
lasso.fit(X_train, y_train) 
# Making predictions on Test data 
y_pred_lasso = lasso.predict(X_test) 
# Compute rmse and R2 
rmse_lasso = np.sqrt(mean_squared_error(y_test, y_pred_lr))
R2_score_lasso = r2_score(y_test, y_pred_lasso)
print("Lasso Regression \t RMSE = {:.2f} \t R2_score = {:.2f}".format(rmse_lasso, R2_score_lasso))

# Let's apply Ridge Regression Algorithm
print()
# Let's apply Lasso regression algorithm
ridge = Ridge()  
# Fitting models on Training data 
ridge.fit(X_train, y_train) 
# Making predictions on Test data 
y_pred_ridge = ridge.predict(X_test) 
# Compute rmse and R2 
rmse_ridge = np.sqrt(mean_squared_error(y_test, y_pred_lr))
R2_score_ridge = r2_score(y_test, y_pred_lr)
print("Ridge Regression \t RMSE = {:.2f} \t R2_score = {:.2f}".format(rmse_ridge, R2_score_ridge))
0.321898566500731


Linear Regression 	 RMSE = 0.38 	 R2_score = 0.35

Lasso Regression 	 RMSE = 0.38 	 R2_score = -0.00

Ridge Regression 	 RMSE = 0.38 	 R2_score = 0.35

NOTE: Looking at the above results, Linear and Ridge regression shows same behaviour. Lasso algorithm has same RMES error but the R2_score is very bad. Let's work on poly regression model

In [41]:
# Linear Regression 
lr = LinearRegression()
lr.fit(X_train, y_train)

# PolynomialFeatures 
poly = PolynomialFeatures(degree=3, include_bias=False)
poly_X_train = poly.fit_transform(X_train)
# print(poly_X_train)

# Fitting models on Training data 
poly_regress_model = lr.fit(poly_X_train, y_train)
print(poly_regress_model.score(poly_X_train, y_train))
0.7166303644377656

When applied polynomial feature with degree=3, the socre has improved from 0.3218 to 0.7166

Creating the Model and Tuning It (30 marks)

Algorithms that you think will be suitable for this project. Use Kfold Cross-Validation to evaluate model performance. Use appropriate metrics and make a DataFrame to compare models w.r.t their metrics. (at least 3 algorithms, one bagging and one boosting based algorithms have to be there). (15 marks)

Let's Implement models with Kfold cross-validation with various models

Let's implement the common functions used for the all the models

In [42]:
# Let's create an empty dictionary to store model name and performance matrix
m_perf_matrix = dict()
In [43]:
## Create a function to get confusion matrix in a proper format
def pltConfusionMatrix(actual, predicted):
    cm = confusion_matrix(actual, predicted)
    sns.heatmap(cm, annot=True,  fmt='.4f', xticklabels=[0,1], yticklabels=[0,1])
    plt.ylabel('Observed')
    plt.xlabel('Predicted')
    plt.show()
In [44]:
# Compute performance matrix
def computePerfMatrix(m_classifier, m_predict, kfold=None):
    # Let's store all the perf variables in the dict.
    perf_matrix = dict()

    perf_matrix['training_accuracy'] = m_classifier.score(X_train, y_train)
    print("Trainig dataset accuracy:", m_classifier.score(X_train, y_train))
    print()
    perf_matrix['test_accuracy'] = m_classifier.score(X_test, y_test)
    print("Testing dataset accuracy:", m_classifier.score(X_test, y_test))
    print()
    print('Confusion Matrix:')
    print(pltConfusionMatrix(y_test, m_predict))
    print()
    perf_matrix['recall_score'] = recall_score(y_test, m_predict, average="micro")
    print("Recall:",recall_score(y_test, m_predict, average="micro"))
    print()
    perf_matrix['precision_score'] = precision_score(y_test, m_predict, average="micro")
    print("Precision:",precision_score(y_test, m_predict, average="micro"))
    print()
    perf_matrix['f1_score'] = f1_score(y_test, m_predict, average="micro")
    print("F1 Score:", f1_score(y_test, m_predict, average="micro"))
    print()
    perf_matrix['roc_auc_score'] = roc_auc_score(y_test, m_predict, average="micro")
    print("Roc Auc Score:", roc_auc_score(y_test, m_predict, average="micro"))
    print()
    if (kfold is not None):
        perf_matrix['cross_val_score'] = cross_val_score(m_classifier, csp_df_X, csp_df_Y, cv=kfold, scoring='roc_auc').mean()
        print("cross_val_score:", cross_val_score(m_classifier, csp_df_X, csp_df_Y, cv=kfold, scoring='roc_auc').mean())
    else:
        perf_matrix['cross_val_score'] = cross_val_score(m_classifier, csp_df_X, csp_df_Y, cv=10, scoring='roc_auc').mean()
        print("cross_val_score:", cross_val_score(m_classifier, csp_df_X, csp_df_Y, cv=10, scoring='roc_auc').mean())
    # Return the model specific resultssss
    return perf_matrix
In [45]:
# Visualize the performance matrix
# # Visualize model's performance with yellowbrick library
def visualizePerfMatrix(m_classifier):
    viz = ClassificationReport(m_classifier)
    viz.fit(X_train, y_train)
    viz.score(X_test, y_test)
    viz.show()

    roc = ROCAUC(m_classifier)
    roc.fit(X_train, y_train)
    roc.score(X_test, y_test)
    roc.show()

Logistic Regression model

In [46]:
# let's make Logistic Regression model
# 
lr = LogisticRegression(random_state=11, penalty='l1', solver='liblinear', class_weight=None, C=1.0)
lr.fit(X_train, y_train)
lr_predict = lr.predict(X_test)

# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)

# # Compute the various perf matrix
perf_matrix_d = computePerfMatrix(lr, lr_predict, kfold=k_fold)
m_perf_matrix["LogisticRegression"] = perf_matrix_d
Trainig dataset accuracy: 0.8520625889046942

Testing dataset accuracy: 0.8609271523178808

Confusion Matrix:
2020-10-30T13:50:47.176706 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
None

Recall: 0.8609271523178808

Precision: 0.8609271523178808

F1 Score: 0.8609271523178808

Roc Auc Score: 0.8452060069310744

cross_val_score: 0.9229199397372276
In [47]:
# Visualize model's performance with yellowbrick library
visualizePerfMatrix(lr)
2020-10-30T13:50:52.433589 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-10-30T13:50:52.780456 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Decision Tree model

In [48]:
# Decision tree model with criterion=gini which is default and with random_state=11. Let's change the max_depth=4.
 
dtc = DecisionTreeClassifier(criterion='gini', random_state=11, max_depth=4)
dtc.fit(X_train, y_train)
# Predict the performance of test samples
dtc_predict = dtc.predict(X_test)

# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)

# # Compute the various perf matrix
perf_matrix_d = computePerfMatrix(dtc, dtc_predict, kfold=k_fold)
m_perf_matrix["DecisionTree"] = perf_matrix_d
Trainig dataset accuracy: 0.8890469416785206

Testing dataset accuracy: 0.8642384105960265

Confusion Matrix:
2020-10-30T13:50:57.203294 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
None

Recall: 0.8642384105960265

Precision: 0.8642384105960265

F1 Score: 0.8642384105960265

Roc Auc Score: 0.8195995379283788

cross_val_score: 0.8646433641137372
In [49]:
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(dtc)
2020-10-30T13:51:01.571298 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-10-30T13:51:01.939534 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Random Forest classifier Model

In [50]:
# Let's build RandomForestClassification model
rfc = RandomForestClassifier(n_estimators=50, criterion='gini', random_state=11)
rfc = rfc.fit(X_train, y_train)
# Predict the performance of the test samples
rfc_predict = rfc.predict(X_test)

# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
In [51]:
# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(rfc, rfc_predict, kfold=k_fold)
m_perf_matrix["RandomForest"] = perf_matrix_d
Trainig dataset accuracy: 0.9971550497866287

Testing dataset accuracy: 0.9370860927152318

Confusion Matrix:
2020-10-30T13:51:14.388521 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
None

Recall: 0.9370860927152318

Precision: 0.9370860927152318

F1 Score: 0.9370860927152318

Roc Auc Score: 0.9233731228340393

cross_val_score: 0.9765778862471493
In [52]:
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(rfc)
2020-10-30T13:51:20.773033 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-10-30T13:51:21.131334 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Adaboost Classifier Model

In [53]:
# Let's implement the adaboost emsemble classifier algorithm
abc = AdaBoostClassifier(n_estimators=50, learning_rate=0.1, random_state=11)
abc = abc.fit(X_train, y_train)
# Predict the performance of the tests samples
abc_predict = abc.predict(X_test)

# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
In [54]:
# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(abc, abc_predict, kfold=k_fold)
m_perf_matrix["AdaBoost"] = perf_matrix_d
Trainig dataset accuracy: 0.8520625889046942

Testing dataset accuracy: 0.8675496688741722

Confusion Matrix:
2020-10-30T13:51:31.210384 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
None

Recall: 0.8675496688741722

Precision: 0.8675496688741722

F1 Score: 0.8675496688741722

Roc Auc Score: 0.8134867154408933

cross_val_score: 0.9274864927168069
In [55]:
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(abc)
2020-10-30T13:51:35.867684 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-10-30T13:51:36.243034 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Bagging Classifier model

In [56]:
# Let's build bagging model
bgc = BaggingClassifier(n_estimators=50, max_samples=0.7, random_state=11, bootstrap=True, oob_score=True)
bgc = bgc.fit(X_train, y_train)
# Predict the performance of the tests samples
bgc_predict = bgc.predict(X_test)

# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
In [57]:
# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(bgc, bgc_predict, kfold=k_fold)
m_perf_matrix["Bagging"] = perf_matrix_d
Trainig dataset accuracy: 0.9914651493598862

Testing dataset accuracy: 0.9304635761589404

Confusion Matrix:
2020-10-30T13:51:44.756475 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
None

Recall: 0.9304635761589404

Precision: 0.9304635761589404

F1 Score: 0.9304635761589404

Roc Auc Score: 0.9139391605698882

cross_val_score: 0.9711233449514607
In [58]:
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(bgc)
2020-10-30T13:51:52.473027 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-10-30T13:51:52.867341 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Gradient Boosting Classifier model

In [59]:
gbc = GradientBoostingClassifier(n_estimators=50, learning_rate = 0.1, random_state=11, max_depth=4)
gbc = gbc.fit(X_train, y_train)
# Predict the performance of the tests samples
gbc_predict = gbc.predict(X_test)

# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)
In [60]:
# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(gbc, gbc_predict, kfold=k_fold)
m_perf_matrix["GradientBoosting"] = perf_matrix_d
Trainig dataset accuracy: 0.984352773826458

Testing dataset accuracy: 0.9139072847682119

Confusion Matrix:
2020-10-30T13:51:59.462160 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
None

Recall: 0.9139072847682119

Precision: 0.9139072847682119

F1 Score: 0.9139072847682119

Roc Auc Score: 0.8990180978051598

cross_val_score: 0.9725380696823087
In [61]:
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(gbc)
2020-10-30T13:52:03.742958 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-10-30T13:52:04.199474 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Conclusion Summary of the performance of KFold cross validations

In [62]:
print(m_perf_matrix)
{'LogisticRegression': {'training_accuracy': 0.8520625889046942, 'test_accuracy': 0.8609271523178808, 'recall_score': 0.8609271523178808, 'precision_score': 0.8609271523178808, 'f1_score': 0.8609271523178808, 'roc_auc_score': 0.8452060069310744, 'cross_val_score': 0.9229199397372276}, 'DecisionTree': {'training_accuracy': 0.8890469416785206, 'test_accuracy': 0.8642384105960265, 'recall_score': 0.8642384105960265, 'precision_score': 0.8642384105960265, 'f1_score': 0.8642384105960265, 'roc_auc_score': 0.8195995379283788, 'cross_val_score': 0.8646433641137372}, 'RandomForest': {'training_accuracy': 0.9971550497866287, 'test_accuracy': 0.9370860927152318, 'recall_score': 0.9370860927152318, 'precision_score': 0.9370860927152318, 'f1_score': 0.9370860927152318, 'roc_auc_score': 0.9233731228340393, 'cross_val_score': 0.9765778862471493}, 'AdaBoost': {'training_accuracy': 0.8520625889046942, 'test_accuracy': 0.8675496688741722, 'recall_score': 0.8675496688741722, 'precision_score': 0.8675496688741722, 'f1_score': 0.8675496688741722, 'roc_auc_score': 0.8134867154408933, 'cross_val_score': 0.9274864927168069}, 'Bagging': {'training_accuracy': 0.9914651493598862, 'test_accuracy': 0.9304635761589404, 'recall_score': 0.9304635761589404, 'precision_score': 0.9304635761589404, 'f1_score': 0.9304635761589404, 'roc_auc_score': 0.9139391605698882, 'cross_val_score': 0.9711233449514607}, 'GradientBoosting': {'training_accuracy': 0.984352773826458, 'test_accuracy': 0.9139072847682119, 'recall_score': 0.9139072847682119, 'precision_score': 0.9139072847682119, 'f1_score': 0.9139072847682119, 'roc_auc_score': 0.8990180978051598, 'cross_val_score': 0.9725380696823087}}
In [63]:
# Change dataframe precision to 4 points
pd.options.display.float_format = '{:,.4f}'.format

# Conver the dictionary to dataframe
kfold_summary_df = pd.DataFrame.from_dict(m_perf_matrix,orient='index').T
kfold_summary_df
Out[63]:
LogisticRegression DecisionTree RandomForest AdaBoost Bagging GradientBoosting
training_accuracy 0.8521 0.8890 0.9972 0.8521 0.9915 0.9844
test_accuracy 0.8609 0.8642 0.9371 0.8675 0.9305 0.9139
recall_score 0.8609 0.8642 0.9371 0.8675 0.9305 0.9139
precision_score 0.8609 0.8642 0.9371 0.8675 0.9305 0.9139
f1_score 0.8609 0.8642 0.9371 0.8675 0.9305 0.9139
roc_auc_score 0.8452 0.8196 0.9234 0.8135 0.9139 0.8990
cross_val_score 0.9229 0.8646 0.9766 0.9275 0.9711 0.9725

Noted following from above table:

RandomForest and Bagging algorithm are the the best overall algorithms.

Let's implement Random Search on all the above algorithms and compare them

Techniques employed to squeeze that extra performance out of the model without making it overfit. Use Grid Search or Random Search on any of the two models used above. Make a DataFrame to compare models after hyperparameter tuning and their metrics as above. (15 marks)

In [64]:
# Let's create an empty dictionary to store model name and performance matrix
random_search_perf_matrix = dict()
In [65]:
# Specify parameters and distributions to sample from
param_dist = {
    "penalty" : ['l1', 'l2'],
    "C": [0.1, 0.15, 0.50, 0.75, 0.90, 1.0],
    "solver" : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    "dual" : [True, False],
    "fit_intercept" : [True, False]

}

# Number of random samples
samples = 11  

random_search_lr = RandomizedSearchCV(lr, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold

random_search_lr.fit(csp_df_X, csp_df_Y)

print(random_search_lr.best_params_)

random_search_perf_matrix["random_search_lr"] = random_search_lr.best_params_
{'solver': 'newton-cg', 'penalty': 'l2', 'fit_intercept': False, 'dual': False, 'C': 0.5}
In [66]:
# Specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": [1, 3, 8],
              "min_samples_split": [2, 3, 8],
              "min_samples_leaf": [1, 3, 8],
              #"bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# Number of random samples
samples = 11  

random_search_dtc = RandomizedSearchCV(dtc, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold

random_search_dtc.fit(csp_df_X, csp_df_Y)

print(random_search_dtc.best_params_)

random_search_perf_matrix["random_search_dtc"] = random_search_dtc.best_params_
{'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 3, 'max_depth': None, 'criterion': 'entropy'}
In [67]:
# Specify parameters and distributions to sample from
param_dist = {"max_depth": [3, None],
              "max_features": [1, 3, 8],
              "min_samples_split": [2, 3, 8],
              "min_samples_leaf": [1, 3, 8],
              "bootstrap": [True, False],
              "criterion": ["gini", "entropy"]}

# Number of random samples
samples = 11  

random_search_rfc = RandomizedSearchCV(rfc, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold

random_search_rfc.fit(csp_df_X, csp_df_Y)

print(random_search_rfc.best_params_)

random_search_perf_matrix["random_search_rfc"] = random_search_rfc.best_params_
{'min_samples_split': 8, 'min_samples_leaf': 1, 'max_features': 8, 'max_depth': None, 'criterion': 'gini', 'bootstrap': True}
In [68]:
# Specify parameters and distributions to sample from
param_dist = {"algorithm" : ['SAMME', 'SAMME.R'],
              "learning_rate": [0.1, 0.5, 0.75, 1.0],
              }

# Number of random samples
samples = 11  

random_search_abc = RandomizedSearchCV(abc, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold

random_search_abc.fit(csp_df_X, csp_df_Y)

print(random_search_abc.best_params_)

random_search_perf_matrix["random_search_abc"] = random_search_abc.best_params_
{'learning_rate': 0.75, 'algorithm': 'SAMME.R'}
In [69]:
# Specify parameters and distributions to sample from
param_dist = {"max_features": [1, 3, 8],
              "bootstrap": [True, False],
              "bootstrap_features" : [True, False],
              "oob_score" : [True, False],
              }

# Number of random samples
samples = 11  

random_search_bgc = RandomizedSearchCV(bgc, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold

random_search_bgc.fit(csp_df_X, csp_df_Y)

print(random_search_bgc.best_params_)

random_search_perf_matrix["random_search_bgc"] = random_search_bgc.best_params_
{'oob_score': False, 'max_features': 8, 'bootstrap_features': True, 'bootstrap': False}
In [70]:
# Specify parameters and distributions to sample from
param_dist = {"max_features": [1, 3, 8],
              "max_features": [1, 3, 8],
              "min_samples_split": [2, 3, 8],
              "min_samples_leaf": [1, 3, 8],
              "learning_rate" : [0.1, 0.5, 0.75, 1.0],
              "loss" : ['deviance', 'exponential'],
              "criterion" : ['friedman_mse', 'mse', 'mae']
              }

# Number of random samples
samples = 11  

random_search_gbc = RandomizedSearchCV(gbc, param_distributions=param_dist, n_iter=samples, cv=k_fold) #default cv = k_fold

random_search_gbc.fit(csp_df_X, csp_df_Y)

print(random_search_gbc.best_params_)

random_search_perf_matrix["random_search_gbc"] = random_search_gbc.best_params_
{'min_samples_split': 8, 'min_samples_leaf': 8, 'max_features': 3, 'loss': 'exponential', 'learning_rate': 0.5, 'criterion': 'friedman_mse'}

Conclusion Summary of the best parameters of Random Search CV with various Algorithms

In [71]:
print (random_search_perf_matrix)
{'random_search_lr': {'solver': 'newton-cg', 'penalty': 'l2', 'fit_intercept': False, 'dual': False, 'C': 0.5}, 'random_search_dtc': {'min_samples_split': 3, 'min_samples_leaf': 1, 'max_features': 3, 'max_depth': None, 'criterion': 'entropy'}, 'random_search_rfc': {'min_samples_split': 8, 'min_samples_leaf': 1, 'max_features': 8, 'max_depth': None, 'criterion': 'gini', 'bootstrap': True}, 'random_search_abc': {'learning_rate': 0.75, 'algorithm': 'SAMME.R'}, 'random_search_bgc': {'oob_score': False, 'max_features': 8, 'bootstrap_features': True, 'bootstrap': False}, 'random_search_gbc': {'min_samples_split': 8, 'min_samples_leaf': 8, 'max_features': 3, 'loss': 'exponential', 'learning_rate': 0.5, 'criterion': 'friedman_mse'}}
In [72]:
# Change dataframe precision to 4 points
pd.options.display.float_format = '{:,.4f}'.format

# Conver the dictionary to dataframe
random_search_summary_df = pd.DataFrame.from_dict(random_search_perf_matrix,orient='index').T
random_search_summary_df
Out[72]:
random_search_lr random_search_dtc random_search_rfc random_search_gbc random_search_bgc random_search_abc
solver newton-cg NaN NaN NaN NaN NaN
penalty l2 NaN NaN NaN NaN NaN
fit_intercept False NaN NaN NaN NaN NaN
dual False NaN NaN NaN NaN NaN
C 0.5000 NaN NaN NaN NaN NaN
min_samples_split NaN 3.0000 8.0000 8.0000 NaN NaN
min_samples_leaf NaN 1.0000 1.0000 8.0000 NaN NaN
max_features NaN 3.0000 8.0000 3.0000 8.0000 NaN
max_depth NaN NaN NaN NaN NaN NaN
criterion NaN entropy gini friedman_mse NaN NaN
bootstrap NaN NaN True NaN False NaN
learning_rate NaN NaN NaN 0.5000 NaN 0.7500
algorithm NaN NaN NaN NaN NaN SAMME.R
oob_score NaN NaN NaN NaN False NaN
bootstrap_features NaN NaN NaN NaN True NaN
loss NaN NaN NaN exponential NaN NaN

`Let's compare final Randomforest and Bragging with above hyper tuned parameters and decide which model is the best

Final Random Forest Model

In [73]:
# Let's build RandomForestClassification model
rfc = RandomForestClassifier(n_estimators=50, criterion='entropy', random_state=11, min_samples_split=3, min_samples_leaf=1, max_features=3, max_depth=None,bootstrap=False)
rfc = rfc.fit(X_train, y_train)
# Predict the performance of the test samples
rfc_predict = rfc.predict(X_test)

# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)

# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(rfc, rfc_predict, kfold=k_fold)
m_perf_matrix["RandomForest"] = perf_matrix_d
Trainig dataset accuracy: 0.9971550497866287

Testing dataset accuracy: 0.9337748344370861

Confusion Matrix:
2020-10-30T13:53:56.473973 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
None

Recall: 0.9337748344370861

Precision: 0.9337748344370861

F1 Score: 0.9337748344370861

Roc Auc Score: 0.9229880631497882

cross_val_score: 0.9743164096541065
In [74]:
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(rfc)
2020-10-30T13:53:59.375797 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-10-30T13:53:59.790932 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Final Bagging Model

In [75]:
# Let's build bagging model
bgc = BaggingClassifier(n_estimators=50, max_samples=0.7, random_state=11, oob_score=False, max_features=8, bootstrap_features=True, bootstrap=False)
bgc = bgc.fit(X_train, y_train)
# Predict the performance of the tests samples
bgc_predict = bgc.predict(X_test)

# Let's compute the kfold
k_fold = KFold(n_splits=10, random_state=11)

# Let's compute all the performance related parameters
perf_matrix_d = computePerfMatrix(bgc, bgc_predict, kfold=k_fold)
m_perf_matrix["Bagging"] = perf_matrix_d
Trainig dataset accuracy: 0.9971550497866287

Testing dataset accuracy: 0.9304635761589404

Confusion Matrix:
2020-10-30T13:54:19.340738 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
None

Recall: 0.9304635761589404

Precision: 0.9304635761589404

F1 Score: 0.9304635761589404

Roc Auc Score: 0.9117731998459762

cross_val_score: 0.9704103683415608
In [76]:
# # Visualize model's performance with yellowbrick library
visualizePerfMatrix(bgc)
2020-10-30T13:54:28.255903 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/
2020-10-30T13:54:28.604958 image/svg+xml Matplotlib v3.3.0, https://matplotlib.org/

Final Summary:

In [77]:
print(m_perf_matrix)
{'LogisticRegression': {'training_accuracy': 0.8520625889046942, 'test_accuracy': 0.8609271523178808, 'recall_score': 0.8609271523178808, 'precision_score': 0.8609271523178808, 'f1_score': 0.8609271523178808, 'roc_auc_score': 0.8452060069310744, 'cross_val_score': 0.9229199397372276}, 'DecisionTree': {'training_accuracy': 0.8890469416785206, 'test_accuracy': 0.8642384105960265, 'recall_score': 0.8642384105960265, 'precision_score': 0.8642384105960265, 'f1_score': 0.8642384105960265, 'roc_auc_score': 0.8195995379283788, 'cross_val_score': 0.8646433641137372}, 'RandomForest': {'training_accuracy': 0.9971550497866287, 'test_accuracy': 0.9337748344370861, 'recall_score': 0.9337748344370861, 'precision_score': 0.9337748344370861, 'f1_score': 0.9337748344370861, 'roc_auc_score': 0.9229880631497882, 'cross_val_score': 0.9743164096541065}, 'AdaBoost': {'training_accuracy': 0.8520625889046942, 'test_accuracy': 0.8675496688741722, 'recall_score': 0.8675496688741722, 'precision_score': 0.8675496688741722, 'f1_score': 0.8675496688741722, 'roc_auc_score': 0.8134867154408933, 'cross_val_score': 0.9274864927168069}, 'Bagging': {'training_accuracy': 0.9971550497866287, 'test_accuracy': 0.9304635761589404, 'recall_score': 0.9304635761589404, 'precision_score': 0.9304635761589404, 'f1_score': 0.9304635761589404, 'roc_auc_score': 0.9117731998459762, 'cross_val_score': 0.9704103683415608}, 'GradientBoosting': {'training_accuracy': 0.984352773826458, 'test_accuracy': 0.9139072847682119, 'recall_score': 0.9139072847682119, 'precision_score': 0.9139072847682119, 'f1_score': 0.9139072847682119, 'roc_auc_score': 0.8990180978051598, 'cross_val_score': 0.9725380696823087}}
In [78]:
# Change dataframe precision to 4 points
pd.options.display.float_format = '{:,.4f}'.format

# Conver the dictionary to dataframe
kfold_summary_df = pd.DataFrame.from_dict(m_perf_matrix,orient='index').T
kfold_summary_df
Out[78]:
LogisticRegression DecisionTree RandomForest AdaBoost Bagging GradientBoosting
training_accuracy 0.8521 0.8890 0.9972 0.8521 0.9972 0.9844
test_accuracy 0.8609 0.8642 0.9338 0.8675 0.9305 0.9139
recall_score 0.8609 0.8642 0.9338 0.8675 0.9305 0.9139
precision_score 0.8609 0.8642 0.9338 0.8675 0.9305 0.9139
f1_score 0.8609 0.8642 0.9338 0.8675 0.9305 0.9139
roc_auc_score 0.8452 0.8196 0.9230 0.8135 0.9118 0.8990
cross_val_score 0.9229 0.8646 0.9743 0.9275 0.9704 0.9725

If you consider the overall model, The Ramdon Forest is the best model

Performance metrics

Precision: Fraction of actuals per label that were correctly classified by the model

Recall: Fraction of predictions that were correctly classified by the model

F1-score: Weighted harmonic mean of the precision and recall. F1-score: 2*(precision x recall)/(precision + recall)

Accuracy: Fraction of all observations that were correctly classified by the model

Macro avg: Calculate metrics for each label, and find their unweighted mean. This does not take label imbalance into account

Micro/weighted avg: Calculate metrics globally by counting the total true positives, false negatives and false positives

AUC Score: Given a random observation from the dataset that belongs to a class, and a random observation that doesn't belong to a class, the AUC is the perecentage of time that our model will classify which is which correctly